Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Raposo, David; Ritter, Sam; Richards, Blake; Lillicrap, Timothy; Humphreys, Peter Conway; Santoro, Adam

Computer Science > Machine Learning

arXiv:2404.02258 (cs)

[Submitted on 2 Apr 2024]

Title:Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Authors:David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro

View PDF HTML (experimental)

Abstract:Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens ($k$) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-$k$ routing mechanism. Since $k$ is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the $k$ tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2404.02258 [cs.LG]
	(or arXiv:2404.02258v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2404.02258

Submission history

From: Adam Santoro [view email]
[v1] Tue, 2 Apr 2024 19:28:11 UTC (1,763 KB)

Computer Science > Machine Learning

Title:Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators